ABanditNAS: Anti-Bandit for Neural Architecture Search

93

Sampling operations

Anti-Bandit LCB

Bi

Bj

Reducing the search space

1

M

m m v

K

u

¦

1

(

1)

M

m m v

K

u

¦



CONV

3x3

CONV

5x5

MAX POOL

3x3

Identity

CONV

3x3

Depth-Wise

CONV 3x3

CONV

5x5

CONV

3x3

MAX POOL

3x3

Depth-Wise

CONV 3x3

Identity

CONV

5x5

MAX POOL

3x3

Identity

Depth-Wise

CONV 3x3

ȳሺ݅ǡ݆ሻ

L



log

(i,j)

(i,j)

k

k,t

(i,j)

k,t

2

N

s (o

)= m

n

Anti-Bandit UCB

log

U

(i, j)

(i, j)

k

k,t

(i, j)

k,t

2

N

s (o

)= m

+

n

FIGURE 4.1

ABanditNAS is divided into two steps: sampling using LCB and abandoning using UCB.

they are confirmed to be bad. Meanwhile, when well trained, weight-free operations will

be compared only with parameterized operations. On the other hand, with the operation

pruning process, the search space becomes smaller and smaller, leading to an efficient search

process.

4.2.1

Anti-Bandit Algorithm

Our goal is to search for network architectures effectively and efficiently. However, a dilemma

exists for NAS about whether to maintain a network structure that offers significant rewards

(exploitation) or to investigate further other network structures (exploration). Based on

probability theory, the multi-armed bandit can solve the aforementioned exploration-versus-

exploitation dilemma, which makes decisions among competing choices to maximize their

expected gain. Specifically, we propose an anti-bandit that chooses and discards the arm k

in the trial based on

˜rk˜δkrk˜rk + ˜δk,

(4.1)

where rk, ˜rk and ˜δk are the true reward, the average reward, and the estimated vari-

ance obtained from arm k. ˜rk is the value term that favors actions that historically

perform well, and ˜δk is the exploration term that gives actions an exploration bonus.

˜rk˜δk and ˜rk + ˜δk can be interpreted as the lower and upper bounds of a confidence

interval,

The traditional UCB algorithm, which optimistically substitutes ˜rk+˜δ for rk, emphasizes

exploration; however, ignores exploitation. Unlike the UCB bandit, we further exploited the

LCB and UCB to balance exploration and exploitation. A smaller LCB usually has little

expectations but significant variance and should be given a larger chance to be sampled for

more trials. Then, based on the observation that the worst operations in the early stage

usually have worse performance at the end [291], we use UCB to prune the operation with

the worst performance and reduce the search space. In summary, we adopt LCB, ˜rk˜δ,

to sample the arm, which should be further optimized, and use UCB, ˜rk + ˜δ, to abandon

the operation with the minimum value. Because the variance is bounded and converges, the

operating estimate value is always close to the true value and gradually approaches the true

value as the number of trials increases. Our anti-bandit algorithm overcomes the limitations

of an exploration-based strategy, including levels of understanding and suboptimal gaps. The

definitions of the value term and the variance term and the proof of our proposed method

are shown below.

Definition 1. If an operation on arm k has been recommended nk times, rewardi is the

reward on arm k on all trails. The value term of anti-bandit is defined as follows

˜rk =

 rewardi

nk

.

(4.2)